Welcome! Meet the teaching team


Chief examiner: Jack Jewson

Communication: All questions need to be communicated through the Moodle Discussion forum. I will not respond to emails about module content. Any of a private matter can be addressed to Jack.Jewson@monash.edu.

Tutors:

  • Floyd: 2nd year PhD student working on election auditing
  • Harriet: 3rd year PhD student working on visualisation of uncertainty
  • Jayani: 3rd year PhD student working on methods for understanding how non-linear dimension reduction warps your data
  • Maliny: MBAt graduate, aspiring to be PhD student in 2026

Oraganisation


  • Thursdays 8am-10am - Lecture: Focus on Methodology
  • Wednesday 1pm-2pm - Online Workshop: Focus on Implementation
  • Monday and Tuesdays - Tutorials: Your chance to implement for yourselves
  • Thursday 10am-11am - Consultation: Chance to ask me for clarifications

What this course is about


  • select and develop appropriate models for clustering, prediction or classification.
  • estimate and simulate from a variety of statistical models.
  • measure the uncertainty of a prediction or classification using resampling methods.
  • apply business analytic tools to produce innovative solutions in finance, marketing, economics and related areas.
  • manage very large data sets in a modern software environment.
  • explain and interpret the analyses undertaken clearly and effectively.

Assessment


  • Assignment 1: 10% (Due 30th March)
  • Assignment 2: 10% (Due 20th April)
  • Project: 20% (Due 25th May), ETC5250 student will present their project in the Week 12 class
  • Final exam: 60%

Weekly learning quizzes: Unassessed. Posted on Fridays, solutions will be discussed in the workshop

Generative AI


  • Can be used for Assignments 1 and 2 and the Project as long as you acknowledge it correctly
  • GenAI are Large Language Models, they are trained to produce text, not to do/understand statistical concepts and math
  • You may find it helps you in coding exercises
  • Use this as a tool to help improve your coding
  • But make sure you check and understand it first

Reading


  • As well as a tutorial each week I will post a little required reading
  • This will augment what I tell you in lectures
  • Everything you need to pass the course will be in the lectures, but the reading will give you another perspective and deepen your understanding

How to do well


  • Keep up-to-date with content:
    • participate in the lecture each week
    • attend tutorials
    • complete weekly learning quiz to check your understanding
    • read the relevant sections of the resource material
    • run the code from lectures in the qmd files
  • Begin assessments early, when posted, map out a plan to complete it on time
  • Ask questions

Looking forward


Machine learning is a big, big area. This semester is like the tip of the iceberg, there’s a lot more, and interesting methods and problems, than what we can cover. Take this as a challenge to get you started, and become hungry to learn more!


  • This class is an Introduction to Machine Learning
  • I will also Lecture ETC3555/5555 Statistical Machine Learning next semester where we will have a chance to dive deeper into some topics

In todays class


  • Look at different types of learning problems
  • Consider what data looks like, both mathematically and on your computer
  • Discuss some basic mathematical concepts that will be useful for the rest of this course
  • Establish a framework for good machine learning modelling


We will continually refer back to the topics introduced in this class throughout the semester

Types of problems

Framing the problem

  1. Supervised classification: categorical \(y_i\in\{1, \ldots, K\}\) is available for all \(x_i\in\mathbb{R}^d\)
  • Your task is to learn the relationship between predictors/features \(x\) and response \(y\)
  • So that you could well predict a future unknown \(y^{\prime}\) from known \(x^{\prime}\)

Framing the problem

  1. Unsupervised learning: \(y_i\) unavailable for all \(x_i\)
  • Your task is to learn what the \(y_i\)’s could have been for each \(x_i\)
  • Uncover some latent structure (e.g. clustering)

What type of problem is this? (1/3)

  • Food servers’ tips in restaurants may be influenced by many factors e.g. the nature of the restaurant, size of the party, and table locations in the restaurant.

  • Restaurant manager cannot directly control the size of the tips

  • Instead they need to know which factors matter when they assign tables to food servers.

  • For the sake of staff morale, they usually want to avoid unfair treatment of the servers, for whom tips (in the U.S.) are a major component of pay.

  • In one restaurant, a food server recorded data on all customers they served during an interval of two and a half months in early 1990.

  • Each record includes a day and time, and taken together, they show the server’s work schedule.

What is \(y\)? What is \(x\)?

What type of problem is this? (2/3)


  • Every person monitored their email for a week and recorded information about each email message; for example, whether it was spam, and what day of the week and time of day the email arrived.
  • We want to use this information to build a spam filter, a classifier that will catch spam with high probability but will never classify good email as spam.

What is \(y\)? What is \(x\)?

What type of problem is this? (3/3)

A health insurance company collected the following information about households:

  • Total number of doctor visits per year
  • Total household size
  • Total number of hospital visits per year
  • Average age of household members
  • Total number of gym memberships
  • Use of physiotherapy and chiropractic services
  • Total number of optometrist visits

The health insurance company wants to provide a small range of products, containing different bundles of services and for different levels of cover, to market to customers.

What is \(y\)? What is \(x\)?

What type of problem is this? Summary


  • Problem 1 was a regression problem. We wanted to predict a continuous response, i.e. the size of the tip in USD, from data
  • Problem 2 was a classification problem. We wanted to predict a discrete response, i.e. whether an email was spam or not, from data
  • Problem 3 was an unsupervised clustering problem. We wanted to identify different attributes of different sections of the population in order to tailor different products to them

Math and computing foundations

Predictors: math

\(n\) number of observations or sample points (different tables in the restaurant)

\(p\) number of variables or the dimension of the data (the number of things we measured about each table)

We can represent this data matrix as:

\[\begin{align*} {\mathbf X}_{n\times p}= ({\mathbf x}_1 ~ {\mathbf x}_2 ~ \dots ~ {\mathbf x}_p) = \left(\begin{array}{cccc} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{array} \right) \end{align*}\]

Where \({\mathbf x}_{i1}\) might be whether it was lunch or dinner, \({\mathbf x}_{i2}\) the size fo the party etc.

This is also considered the matrix of predictors, or explanatory or independent variables, features, attributes, or input.

We will define \(\mathcal{X}\subset \mathbb{R}^p\) as the set in which the predictors live.

Predictors: computing


library(mvtnorm)
vc <- matrix(c(1, 0.5, 0.2, 
               0.5, 1, -0.3, 
               0.2, -0.3, 1), 
             ncol=3, byrow=TRUE)
set.seed(449)
x <- rmvnorm(5, 
             mean = c(-0.2, 0, 0.3), 
             sigma = vc)
x
      [,1]  [,2]  [,3]
[1,] -0.42  1.01 -0.64
[2,] -0.83 -0.99 -0.37
[3,] -0.98  0.66 -0.13
[4,] -0.13 -0.25  0.57
[5,] -0.37  0.55  1.37

What’s the dimension of the data?

library(palmerpenguins)
p_tidy <- penguins |>
  select(species, bill_length_mm:body_mass_g) |>
  rename(bl=bill_length_mm,
         bd=bill_depth_mm,
         fl=flipper_length_mm,
         bm=body_mass_g) 
p_tidy |> slice_head(n=10)
# A tibble: 10 × 5
   species    bl    bd    fl    bm
   <fct>   <dbl> <dbl> <int> <int>
 1 Adelie   39.1  18.7   181  3750
 2 Adelie   39.5  17.4   186  3800
 3 Adelie   40.3  18     195  3250
 4 Adelie   NA    NA      NA    NA
 5 Adelie   36.7  19.3   193  3450
 6 Adelie   39.3  20.6   190  3650
 7 Adelie   38.9  17.8   181  3625
 8 Adelie   39.2  19.6   195  4675
 9 Adelie   34.1  18.1   193  3475
10 Adelie   42    20.2   190  4250

What’s the dimension of the data?

R’s pipe

A reminder for anyone who is not familiar with R’s pipe

library(palmerpenguins)
p_tidy1 <- penguins |>
  select(species, bill_length_mm:body_mass_g) |>
  rename(bl=bill_length_mm,
         bd=bill_depth_mm,
         fl=flipper_length_mm,
         bm=body_mass_g)
p_tidy1 |> slice_head(n=10)
# A tibble: 10 × 5
   species    bl    bd    fl    bm
   <fct>   <dbl> <dbl> <int> <int>
 1 Adelie   39.1  18.7   181  3750
 2 Adelie   39.5  17.4   186  3800
 3 Adelie   40.3  18     195  3250
 4 Adelie   NA    NA      NA    NA
 5 Adelie   36.7  19.3   193  3450
 6 Adelie   39.3  20.6   190  3650
 7 Adelie   38.9  17.8   181  3625
 8 Adelie   39.2  19.6   195  4675
 9 Adelie   34.1  18.1   193  3475
10 Adelie   42    20.2   190  4250
library(palmerpenguins)
p_tidy1 <- select(penguins, species, bill_length_mm:body_mass_g)
p_tidy2 <- rename(p_tidy1, 
         bl=bill_length_mm,
         bd=bill_depth_mm,
         fl=flipper_length_mm,
         bm=body_mass_g)
slice_head(p_tidy2, n=10)
# A tibble: 10 × 5
   species    bl    bd    fl    bm
   <fct>   <dbl> <dbl> <int> <int>
 1 Adelie   39.1  18.7   181  3750
 2 Adelie   39.5  17.4   186  3800
 3 Adelie   40.3  18     195  3250
 4 Adelie   NA    NA      NA    NA
 5 Adelie   36.7  19.3   193  3450
 6 Adelie   39.3  20.6   190  3650
 7 Adelie   38.9  17.8   181  3625
 8 Adelie   39.2  19.6   195  4675
 9 Adelie   34.1  18.1   193  3475
10 Adelie   42    20.2   190  4250

The pipe will tidy up your code if you are going to be applying multiple stpes to the same data

Observations and variables: math


The \(i^{th}\) observation is denoted as

\[\begin{align*} x_i = \left(\begin{array}{cccc} x_{i1} & x_{i2} & \dots & x_{ip} \\ \end{array} \right) \end{align*}\]

The \(j^{th}\) variable is denoted as

\[\begin{align*} x_j = \left(\begin{array}{c} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \\ \end{array} \right) \end{align*}\]

Observations and variables: computing


Observations - rows

x[2,]
[1] -0.83 -0.99 -0.37


p_tidy |> slice_sample()
# A tibble: 1 × 5
  species      bl    bd    fl    bm
  <fct>     <dbl> <dbl> <int> <int>
1 Chinstrap  51.5  18.7   187  3250

Variables - columns

x[,1]
[1] -0.42 -0.83 -0.98 -0.13 -0.37


p_tidy |> pull(fl)
  [1] 181 186 195  NA 193 190 181 195 193 190 186 180 182
 [14] 191 198 185 195 197 184 194 174 180 189 185 180 187
 [27] 183 187 172 180 178 178 188 184 195 196 190 180 181
 [40] 184 182 195 186 196 185 190 182 179 190 191 186 188
 [53] 190 200 187 191 186 193 181 194 185 195 185 192 184
 [66] 192 195 188 190 198 190 190 196 197 190 195 191 184
 [79] 187 195 189 196 187 193 191 194 190 189 189 190 202
 [92] 205 185 186 187 208 190 196 178 192 192 203 183 190
[105] 193 184 199 190 181 197 198 191 193 197 191 196 188
[118] 199 189 189 187 198 176 202 186 199 191 195 191 210
[131] 190 197 193 199 187 190 191 200 185 193 193 187 188
[144] 190 192 185 190 184 195 193 187 201 211 230 210 218
[157] 215 210 211 219 209 215 214 216 214 213 210 217 210
[170] 221 209 222 218 215 213 215 215 215 216 215 210 220
[183] 222 209 207 230 220 220 213 219 208 208 208 225 210
[196] 216 222 217 210 225 213 215 210 220 210 225 217 220
[209] 208 220 208 224 208 221 214 231 219 230 214 229 220
[222] 223 216 221 221 217 216 230 209 220 215 223 212 221
[235] 212 224 212 228 218 218 212 230 218 228 212 224 214
[248] 226 216 222 203 225 219 228 215 228 216 215 210 219
[261] 208 209 216 229 213 230 217 230 217 222 214  NA 215
[274] 222 212 213 192 196 193 188 197 198 178 197 195 198
[287] 193 194 185 201 190 201 197 181 190 195 181 191 187
[300] 193 195 197 200 200 191 205 187 201 187 203 195 199
[313] 195 210 192 205 210 187 196 196 196 201 190 212 187
[326] 198 199 201 193 203 187 197 191 203 202 194 206 189
[339] 195 207 202 193 210 198

Response: math

The response variable (or target variable, output, outcome measurement), when it exists, is denoted as:

\[\begin{align*} {\mathbf y} = \left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{array} \right) \end{align*}\]

We will define \(\mathcal{Y}\) as the set in which the responses live.

The whole dataset can also be written as

\[\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^n = \{(y_1, x_1), (y_2, x_2), \dots, (y_n, x_n)\} \in \{\mathcal{X}\times \mathcal{Y}\}^n\]

where \(x_i\) is a vector with \(p\) elements.

Response: computing


species is the response variable, and it is categorical.

set.seed(424)
p_tidy |> slice_sample(n=10)
# A tibble: 10 × 5
   species      bl    bd    fl    bm
   <fct>     <dbl> <dbl> <int> <int>
 1 Gentoo     47.4  14.6   212  4725
 2 Gentoo     46.2  14.5   209  4800
 3 Chinstrap  45.6  19.4   194  3525
 4 Adelie     38.8  17.6   191  3275
 5 Gentoo     50    15.3   220  5550
 6 Chinstrap  46    18.9   195  4150
 7 Adelie     38.9  18.8   190  3600
 8 Gentoo     46.5  14.5   213  4400
 9 Gentoo     46.8  14.3   215  4850
10 Gentoo     45.7  13.9   214  4400

A binary matrix format is sometimes useful.

set.seed(424)
model.matrix(~ 0 + species, data = p_tidy) |>
  as_tibble() |>
  slice_sample(n=10)
# A tibble: 10 × 3
   speciesAdelie speciesChinstrap speciesGentoo
           <dbl>            <dbl>         <dbl>
 1             0                0             1
 2             0                0             1
 3             0                1             0
 4             1                0             0
 5             0                0             1
 6             0                1             0
 7             1                0             0
 8             0                0             1
 9             0                0             1
10             0                0             1


This is often refferred to as one-hot-encoding or creating dummy variables

Linear algebra


A transposed data matrix is denoted as \[\begin{align*} {\mathbf X}^\top = \left(\begin{array}{cccc} x_{11} & x_{21} & \dots & x_{n1} \\ x_{12} & x_{22} & \dots & x_{n2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1p} & x_{2p} & \dots & x_{np} \end{array} \right)_{p\times n} \end{align*}\]

x
      [,1]  [,2]  [,3]
[1,] -0.42  1.01 -0.64
[2,] -0.83 -0.99 -0.37
[3,] -0.98  0.66 -0.13
[4,] -0.13 -0.25  0.57
[5,] -0.37  0.55  1.37
t(x)
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] -0.42 -0.83 -0.98 -0.13 -0.37
[2,]  1.01 -0.99  0.66 -0.25  0.55
[3,] -0.64 -0.37 -0.13  0.57  1.37

Matrix multiplication: math


\[\begin{align*} {\mathbf A}_{2\times 3} = \left(\begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ \end{array} \right) \end{align*}\]

\[\begin{align*} {\mathbf B}_{3\times 4} = \left(\begin{array}{cccc} b_{11} & b_{12} & b_{13} & b_{14}\\ b_{21} & b_{22} & b_{23} & b_{24}\\ b_{31} & b_{32} & b_{33} & b_{34}\\ \end{array} \right) \end{align*}\]

then

\[\begin{align*} {\mathbf A}{\mathbf B}_{2\times 4} = \left(\begin{array}{cccc} \sum_{j=1}^3 a_{1j}b_{j1} & \sum_{j=1}^3 a_{1j}b_{j2} & \sum_{j=1}^3 a_{1j}b_{j3} & \sum_{j=1}^3 a_{1j}b_{j4}\\ \sum_{j=1}^3 a_{2j}b_{j1} & \sum_{j=1}^3 a_{2j}b_{j2} & \sum_{j=1}^3 a_{2j}b_{j3} & \sum_{j=1}^3 a_{2j}b_{j4} \end{array} \right) \end{align*}\]

Pour the rows into the columns.


Note: You can’t do \({\mathbf B}{\mathbf A}\)!

Matrix multiplication: computing


x
      [,1]  [,2]  [,3]
[1,] -0.42  1.01 -0.64
[2,] -0.83 -0.99 -0.37
[3,] -0.98  0.66 -0.13
[4,] -0.13 -0.25  0.57
[5,] -0.37  0.55  1.37
proj <- matrix(c(1/sqrt(2), 1/sqrt(2), 0, 
                 0, 0, 1), ncol=2, byrow=FALSE)
proj
     [,1] [,2]
[1,] 0.71    0
[2,] 0.71    0
[3,] 0.00    1
x %*% proj
      [,1]  [,2]
[1,]  0.42 -0.64
[2,] -1.28 -0.37
[3,] -0.23 -0.13
[4,] -0.27  0.57
[5,]  0.13  1.37

Try this:

t(x) %*% proj

It produces an error because it can’t be done

Error in t(x) %*% proj : non-conformable arguments



Notice: %*% uses a * so it is NOT the tidyverse pipe.

Identity matrix


\[\begin{align*} I = \left(\begin{array}{cccc} 1 & 0 & \dots & 0 \\ 0 & 1 & & \vdots \\ \vdots & & \ddots & 0\\ 0 & 0 & & 1 \\ \end{array}\right)_{p\times p} \end{align*}\]

diag(1, 8, 8)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    0    0    0    0    0    0    0
[2,]    0    1    0    0    0    0    0    0
[3,]    0    0    1    0    0    0    0    0
[4,]    0    0    0    1    0    0    0    0
[5,]    0    0    0    0    1    0    0    0
[6,]    0    0    0    0    0    1    0    0
[7,]    0    0    0    0    0    0    1    0
[8,]    0    0    0    0    0    0    0    1


This is the matrix equivalent of 1


\(AI = A\) and \(IA = A\) (as long as \(A\) was conformable with these matrix multiplications)

Inverting a matrix: math

Suppose that \({\mathbf A}\) is square

\[\begin{align*} {\mathbf A}_{2\times 2} = \left(\begin{array}{cc} a & b \\ c & d \\ \end{array} \right) \end{align*}\]

then the inverse is (if \(ad-bc \neq 0\))

\[\begin{align*} {\mathbf A}^{-1}_{2\times 2} = \frac{1}{ad-bc} \left(\begin{array}{cc} d & -b \\ -c & a \\ \end{array} \right) \end{align*}\]

and \({\mathbf A}{\mathbf A}^{-1} = I\) where

\[\begin{align*} {\mathbf I}_{2\times 2} = \left(\begin{array}{cc} 1 & 0 \\ 0 & 1 \\ \end{array} \right) \end{align*}\]

If \(AB=I\), then \(B=A^{-1}\).

vc
     [,1] [,2] [,3]
[1,]  1.0  0.5  0.2
[2,]  0.5  1.0 -0.3
[3,]  0.2 -0.3  1.0
vc_i <- solve(vc)
vc_i
      [,1]  [,2]  [,3]
[1,]  1.62 -1.00 -0.63
[2,] -1.00  1.71  0.71
[3,] -0.63  0.71  1.34
vc %*% vc_i
         [,1]     [,2]    [,3]
[1,]  1.0e+00  8.3e-17 0.0e+00
[2,]  2.8e-17  1.0e+00 5.6e-17
[3,] -1.1e-16 -1.1e-16 1.0e+00

We see issues with numerical precision here. This is something you should be aware of but do not need to worry about

Projections

\(d\) \((\leq p)\) is used to denote the number of variables in a lower dimensional space.


A projection takes our data from the original \(p\) dimensional space to the lower \(d\) dimensional space

\(A\) is a \(p\times d\) orthonormal basis if \(A^\top A=I_d\).

i.e. \(\sum_{i=1}^pA_{ij}^2 = 1\) for \(j = 1\ldots, d\) and \(\sum_{i=1}^pA_{ij}A_{ik} = 0\) for \(k \neq j\)


The projection of \({\mathbf x_i}\in\mathbb{R}^p\) onto \(A\) is \(A^\top{\mathbf x}_i\in\mathbb{R}^d\).

proj
     [,1] [,2]
[1,] 0.71    0
[2,] 0.71    0
[3,] 0.00    1
sum(proj[,1]^2)
[1] 1
sum(proj[,2]^2)
[1] 1
sum(proj[,1]*proj[,2])
[1] 0
t(proj)%*%proj
     [,1] [,2]
[1,]    1    0
[2,]    0    1

proj is an orthonormal projection matrix.

Conceptual framework

Accuracy vs interpretability

Why are we learning from data?

Predictive accuracy

The primary purpose is to be able to predict \(\hat{y}\) for new data \(x^{\prime}\). And we’d like to do that well! That is, accurately.

From XKCD

Interpretability

Almost equally important is that we want to understand the relationship between \({\mathbf x}\) and \(y\).

Simpler models are easier to understand.

The simpler model that is (almost) as accurate is the one we choose, always.

From Interpretable Machine Learning

Training vs test splits

When data are reused for multiple tasks, instead of carefully spent from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors. Julia Silge

Therefore we split the data and conduct different tasks on different subsets of data

  • Training set \(\approx 80\%\): Used to fit the model. This might also be broken into a validation set(s) for hyperparameter tuning and/or frequent assessment of fit.
  • Test set \(\approx 20\%\): Purely used to assess final models performance on future data.


DO NOT only evaluate your model on the training set.

The model was estimated on that training set so it is designed to predict that set well.

Therefore this will give you a BIASED estimate of its performance.


In particular a more complex model will always APPEAR to predict the training set better

Training vs test splits

From ‘Pattern Recognition and Machine Learning’ by Christopher Bishop

Training vs test splits


d_bal <- tibble(y=c(rep("A", 6), rep("B", 6)),
                x=c(runif(12)))
d_bal$y
 [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
set.seed(130)
d_bal_split <- initial_split(d_bal, prop = 0.70)
training(d_bal_split)$y
[1] "A" "A" "B" "A" "B" "A" "B" "A"
testing(d_bal_split)$y
[1] "A" "B" "B" "B"


Setting the seed ensures that if we reproduce this random split, we would get the same trainign and testing sets.

d_unb <- tibble(y=c(rep("A", 2), rep("B", 10)),
                x=c(runif(12)))
d_unb$y
 [1] "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
set.seed(132)
d_unb_split <- initial_split(d_unb, prop = 0.70)
training(d_unb_split)$y
[1] "B" "B" "A" "B" "B" "A" "B" "B"
testing(d_unb_split)$y
[1] "B" "B" "B" "B"


Always stratify splitting by sub-groups, especially response variable classes.

d_unb_strata <- initial_split(d_unb, prop = 0.70, strata=y)
training(d_unb_strata)$y
[1] "A" "B" "B" "B" "B" "B" "B" "B"
testing(d_unb_strata)$y
[1] "A" "B" "B" "B"


This ensure that your training set adequately resembles your testing set.

Measuring accuracy for categorical response


For a supervised learning problem, our machine learning model is a function that we have learned from the training data that allows us to produce a prediction for a new observation

Let \(\widehat{y}:\mathcal{X}\mapsto\mathcal{Y}\) be the model estimated from training data, \(\{(y_i, {\mathbf x}_i)\}_{i = 1}^n\).

Consider a classification problem i.e. \(\mathcal{Y} = \{1, \ldots, K\}\), a number for each class.

We can evaluate this model using the error rate (fraction of misclassifications)

We can look at this in the training data, \(\{(y_i, {\mathbf x}_i)\}_{i = 1}^n\) to get the Training Error Rate

\[\text{Error rate} = \frac{1}{n}\sum_{i=1}^n I(y_i \ne \widehat{y}({\mathbf x}_i))\]

But, a better estimate of future accuracy is obtained using test data to get the Test Error Rate.


Training error will usually be smaller than test error. When it is much smaller, it indicates that the model is too well fitted to the training data to be accurate on future data (over-fitted).

Confusion (misclassification) matrix


You can visualise this classification performance using a confusion matrix

predicted
1 0
true 1 a b
0 c d

Consider 1=positive (P), 0=negative (N).

  • True positive (TP): a
  • True negative (TN): d
  • False positive (FP): c (Type I error)
  • False negative (FN): b (Type II error)
  • Sensitivity, recall, hit rate, or true positive rate (TPR): TP/P = a/(a+b)
  • Specificity, selectivity or true negative rate (TNR): TN/N = d/(c+d)
  • Prevalence: P/(P+N) = (a+b)/(a+b+c+d)
  • Accuracy: (TP+TN)/(P+N) = (a+d)/(a+b+c+d)
  • Balanced accuracy: (TPR + TNR)/2 (or average class errors)

Confusion (misclassification) matrix: computing

Two classes

# Write out the confusion matrix in standard form
#| label: oconfusion-matrix-tidy
cm <- a2 |> count(y, pred) |>
  group_by(y) |>
  mutate(cl_acc = n[pred==y]/sum(n)) 
cm |>
  pivot_wider(names_from = pred, 
              values_from = n) |>
  select(y, bilby, quokka, cl_acc)
# A tibble: 2 × 4
# Groups:   y [2]
  y      bilby quokka cl_acc
  <fct>  <int>  <int>  <dbl>
1 bilby      9      3  0.75 
2 quokka     5     10  0.667
accuracy(a2, y, pred) |> pull(.estimate)
[1] 0.7
bal_accuracy(a2, y, pred) |> pull(.estimate)
[1] 0.71
sens(a2, y, pred) |> pull(.estimate)
[1] 0.75
specificity(a2, y, pred) |> pull(.estimate)
[1] 0.67

More than two classes

# Write out the confusion matrix in standard form
cm3 <- a3 |> count(y, pred) |>
  group_by(y) |>
  mutate(cl_err = n[pred==y]/sum(n)) 
cm3 |>
  pivot_wider(names_from = pred, 
              values_from = n, values_fill=0) |>
  select(y, bilby, quokka, numbat, cl_err)
# A tibble: 3 × 5
# Groups:   y [3]
  y      bilby quokka numbat cl_err
  <fct>  <int>  <int>  <int>  <dbl>
1 bilby      9      3      0  0.75 
2 numbat     0      2      6  0.75 
3 quokka     5     10      0  0.667
accuracy(a3, y, pred) |> pull(.estimate)
[1] 0.71
bal_accuracy(a3, y, pred) |> pull(.estimate)
[1] 0.78

Predicting Probabilities


I lied before!

In classification, the majority of machine learning models will not be a functions that produces prediction - \(\widehat{y}:\mathcal{X}\mapsto\{1, \ldots, K\}\)


We will see in this class that it is much better to estimate probabilities of being in each class.

Let these probabilities be \(\widehat{p}:\mathcal{X}\mapsto (0, 1)^K\) with \(\sum_{k=1}^K\widehat{p}(x_i)_k = 1\) where \(\widehat{p}(x_i)_k = \mathbb{P}(y = k\mid x_i)\)


We can always turn these probabilities into class prediction i.e. \(\widehat{y}(x_i) = \arg\max \widehat{p}(x_i)_k\)

But the probabilities themselves contain much more information than just a class labels

And it is easier to train a model to learn them

Probabilities to prediction

While it is easier, and more informative, to model probabilities, eventually a decision/prediction must be made.

Consider binary classification, i.e. \(\mathcal{Y} = \{0, 1\}\)

Then a decision/prediction rule would be to set \(\hat{y}(x_i) = 1\) if \(\hat{p}(x_i) > \alpha\)


One logical value for \(\alpha = 0.5\)

However if the probability that a relative has cancer is 0.45, do you want the doctor to predict they don’t have cancer??


The value of \(\alpha\) is problem specific and depends on the consequences of missclassification.

It is not our job as data scientists to decide on \(\alpha\), we model the probabilities and let a domain expert turn this into a decision.

Receiver Operator Curves (ROC)

Therefore we need to be able to evaluate our models for any value of \(\alpha\)

Evaluate TPR and FNR for different \(\alpha\)

If \(\alpha\) = 1, \(\hat{p}(x_i)\) can never be greater than 1

TPR = 0, FPR = 0


If \(\alpha\) = 0, \(\hat{p}(x_i)\) always greater than 0

TPR = 1, FPR = 1


For a good classifier, as we increase \(\alpha\) the TPR should go up faster than the FPR

From wikipedia

The ROC curve allows you to evaluate the accuracy for all values of \(\alpha\)

Receiver Operator Curves (ROC)


Therefore we need to be able to evaluate our models for any value of \(\alpha\)

From wikipedia
a2 |> slice_head(n=3)
# A tibble: 3 × 4
  y     pred  bilby quokka
  <fct> <fct> <dbl>  <dbl>
1 bilby bilby   0.9    0.1
2 bilby bilby   0.8    0.2
3 bilby bilby   0.9    0.1
roc_curve(a2, y, bilby) |>
  autoplot()

Parametric vs non-parametric


We will look at both parameter and non-parametric machine learning models (for \(\hat{y}\) or \(\hat{p}\))

Parametric methods

  • Assume that the model takes a specific form
  • e.g. \(\hat{y}(x) = x^\top\theta\) (linear model) or \(\hat{p}(x) = \frac{1}{1+\exp(x^\top\theta)}\) (logistic regression)
  • Fitting then is a matter of estimating the parameters of the model, i.e. \(\theta\)
  • Generally considered to be less flexible
  • If assumptions are wrong, performance likely to be poor

Non-parametric methods

  • No specific assumptions
  • Allow the data to specify the model form, without being too rough or wiggly
  • More flexible
  • Generally needs more observations, and not too many variables
  • Easier to over-fit
  • e.g. k-Nearest Neighbours or Neural Networks

Parametric vs non-parametric

From XKCD

Parametric vs non-parametric

Black line is true boundary.

Parametric vs non-parametric

Reducible vs irreducible error

If the model form is incorrect, the error (solid circles) may arise from wrong shape, and is thus reducible

Irreducible means that we have got the right model and mistakes (solid circles) are random noise.

Flexible vs inflexible


Parametric models tend to be less flexible but non-parametric models can be flexible or less flexible depending on parameter settings.

Bias vs variance

A framework for thinking about how much flexibility we need

Bias is the error that is introduced by modeling a complicated problem using a simpler model.

  • For example, linear regression assumes a linear relationship and perhaps the relationship is not exactly linear.
  • In general, the more flexible a method is, the less bias it will have because it can fit a complex shape better.

Variance refers to how much your estimate would change if you had different training data. Its measuring how much your model depends on the data you have, to the neglect of future data.

  • In general, the more flexible a method is, the more variance it has.
  • The size of the training data can impact on the variance.

Bias

When you impose too many assumptions with a parametric model, or use an inadequate non-parametric model, such as not letting an algorithm converge fully.

When the model closely captures the true shape, with a parametric model or a flexible model.

Variance

This fit is virtually identical even though we had a different training sample.

Get a very different model if a different training set is used.

Bias-variance tradeoff


Fig 2.16 from ISLR Fig 2.16 from ISLR

Images 2.16, 2.15 from ISLR

Goal: Without knowing what the true structure is, fit the signal and ignore the noise. Be flexible but not too flexible.

Trade-off between accuracy and interpretability

Diagnosing the fit

Compute and examine the usual diagnostics, some methods have more

  • fit statistics: accuracy, sensitivity, specificity
  • errors/misclassifications
  • variable importance
  • plot residuals, examine the misclassifications
  • check test set is similar to training

Go beyond … Look at the data and the model together!

Wickham et al (2015) Removing the Blindfold

Training - plusses; Test - dots

Feature engineering

Some times to get a better fit (reduce the bias) you need a more complex model

Other times you can get the same results by transforming the original variables (See tidymodels steps.)

  • scaling, centering, sphering (step_pca)
  • log or square root or box-cox transformation (step_log)
  • ratio of values (step_ratio)
  • polynomials or splines: \(x_1^2, x_1^3\) (step_ns)
  • dummy variables: categorical predictors expanded into multiple new binary variables (step_dummy)
  • Convolutional Neural Networks: neural networks but with pre-processing of images to combine values of neighbouring pixels; flattening of images

Some models will even learn this transformations for you

The big picture

  1. Know your data
    • Categorical response or no response
    • Types of predictors: quantitative, categorical
    • Independent observations
    • Do you need to handle missing values?
    • Are there anomalous observations?
  2. Plot your data
    • What are the shapes (distribution and variance)?
    • Are there gaps or separations (centres)?
  1. Fit a model or two
    • Compute fit statistics
    • Plot the model
    • Examine parameter estimates
  2. Diagnostics
    • Which is the better model
    • Is there a simpler model?
    • Are the errors reducible or systematic?
    • Are you confident that your bias is low and variance is low?

Next: Visualisation